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Abstract 

Background: Modern genomic technologies produce large amounts of data that can be mapped to specific 
regions in the genome. Among the first steps in interpreting the results is annotation of genomic regions with 
known features such as genes, promoters, CpG islands etc. Several tools have been published to perform this task. 
However, using these tools often requires a significant amount of bioinformatics skills and/or downloading and 
installing dedicated software. 

Results: Here we present AnnotateGenomicRegions, a web application that accepts genomic regions as input and 
outputs a selection of overlapping and/or neighboring genome annotations. Supported organisms include human 
(hg18, hg19), mouse (mm8, mm9, mm 10), zebrafish (danRer7), and Saccharomyces cerevisiae (sacCer2, sacCer3). 
AnnotateGenomicRegions is accessible online on a public server or can be installed locally. Some frequently used 
annotations and genomes are embedded in the application while custom annotations may be added by the user. 

Conclusions: The increasing spread of genomic technologies generates the need for a simple-to-use annotation 
tool for genomic regions that can be used by biologists and bioinformaticians alike. AnnotateGenomicRegions 
meets this demand. AnnotateGenomicRegions is an open-source web application that can be installed on any 
personal computer or institute server. AnnotateGenomicRegions is available at: http://cru.genomics.iit.it/ 
AnnotateGenomicRegions. 



Background 

A common denominator for all applications of Next Gen- 
eration Sequencing technology is the need to annotate 
genomic regions of interest. This task is usually performed 
by bioinformaticians who prepare the data as custom 
tracks for genome browsers and use a set of additional 
tools to produce tabular annotations to be scrutinized by 
biologists. Using these tools often requires a significant 
amount of bioinformatics skills and/or downloading and 
installing dedicated software. Tools have been developed 
that comprise functional annotation, for example CisGen- 
ome, W-ChlPeaks, Sole-Search, or CASSys [1-4]. These 
tools focus on the identification of enriched regions in 
chromatin immunoprecipitation sequencing (ChlP-seq) 
experiments and annotation of genomic regions is 
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provided as a side-aspect. Therefore, using these tools for 
annotation purposes only is cumbersome. Command-line 
tools such as BEDTools [5] are very powerful at identifying 
overlapping regions in two files provided in browser exten- 
sible data (BED) format. Details on this format can be 
found at https://genome.ucsc.edu/FAQ/FAQformat.html. 
But being command-line tools, they are hard to use for 
biologists. The same is true for the BioConductor ChlP- 
peakAnno package [6]. Tools such as the EnsEMBL Ruby 
API [7] require considerable programming skills, which 
precludes widespread use by biologists. 

Galaxy [8] is a sophisticated web-based suite of genome 
analysis tools that can also perform annotation of geno- 
mic regions as part of the "Operate on Genomic Inter- 
vals" menu option. It is an expert tool that requires some 
familiarity. The option "Fetch closest non-overlapping 
feature" will find annotations that have been defined as 
"neighbors" in this work. The file defining the neighbors 
must be uploaded along with the query regions. No 
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default annotations for neighbor fetching are provided. 
Only one annotation can be fetched at the time. Identifi- 
cation of overlapping features requires the use of a differ- 
ent menu option ("Intersect"). The UCSC table browser 
[9] has the functionalities required to annotate sets of 
genomic regions. However, the input is restricted to 
1,000 regions, which makes this tool cumbersome to use 
for the annotation of large genomic experiments, for 
example. 

A widely accepted, web-based annotation tool avail- 
able to bioinformaticians and biologists with widely 
varying skill levels is not available. Here we present 
AnnotateGenomicRegions [10], a web application that 
accepts genomic regions as input and outputs overlap- 
ping and/or neighboring genome annotations chosen on 
a simple web-form. 

Implementation 

AnnotateGenomicRegions has been developed using 
Java Enterprise technology on the NetBeans 7.1 Inte- 
grated Development Environment http://netbeans.org/ 
and runs on a Glassfish version 3.1 web server http:// 
glassfish.java.net/. We also successfully tested other Java 
Enterprise Edition servers such as Apache TomEE 1.5.2 
http://tomee.apache.org, JBOSS community edition 6.1 
http://www.jboss.org/, and WebSphere community edi- 
tion 3.0 http://www-03.ibm.com/software/products/us/ 
en/appserv-wasce/. Java Server Faces https://javaserver- 
faces.java.net/ and PrimeFaces http://primefaces.org/ fra- 
meworks have been chosen for rendering the graphical 
user interface. Apache Maven http://maven.apache.org/ 
is used as a software management tool. AnnotateGen- 
omicRegions relies on a set of Java beans to process the 
annotation queries and returns the annotations as 
zipped, tab-delimited tables. A set of tutorials and exam- 
ples are provided to allow the user to get started quickly. 
The annotations are kept on the server for two hours 
before being deleted. AnnotateGenomicRegions is a 
Sourceforge project and can be downloaded from http:// 
sourceforge.net/projects/annotatelocus/ along with 
detailed descriptions of input and output formats. 

AnnotateGenomicRegions provides the annotations and 
genomes most frequently requested by biologists working 
with the developers of the tool. Currently, the annotations 
comprise Refseq transcripts [11], EnsEMBL transcripts 
[12], all_mrna transcripts [9], CpG islands, and promoter 
regions of transcripts. Promoter regions are defined as 
1 kb regions upstream and downstream of the correspond- 
ing transcription start site. The annotations are down- 
loaded from the UCSC genome browser [13], formatted, 
sorted by chromosome, start position and end position, 
and incorporated in the annotation pipeline. A compre- 
hensive list of annotations and genomes for the release of 
October 2012 (Oct2012) is shown in Additional file 1. The 



users are welcome to request additional annotations or 
genomes that we may incorporate in the online application 
using the "Contact" form. Annotations will be updated 
on a yearly basis. AnnotateGenomicRegions doesn't strive 
to provide a comprehensive list of annotations for all avail- 
able genome assemblies. AnnotateGenomicRegions 
permits uploading customized annotations instead. 
Furthermore, researchers necessitating annotations not 
included in AnnotateGenomicRegions are encouraged to 
run their own local and customized installation. 

Results and discussion 

Conscious of the need for an easy to use application for 
annotating genomic regions, we have developed and 
made available online AnnotateGenomicRegions, a fast 
web application that allows submitting a list of genomic 
regions and displays the corresponding annotations in a 
web page that can be exported in a tab delimited format 
recognized by spreadsheet programs. 

AnnotateGenomicRegions annotates sets of genomic 
regions of interest with overlapping and/or neighboring 
features that are mapped to the genome. Genomic 
regions of interest can be derived from experiments such 
as ChlP-seq, DNase I hypersensitive sites sequencing 
(DNase-seq), methylation profiling using reduced repre- 
sentation bisulfite sequencing (Methyl RRBS), quantita- 
tion of small RNAs by massively parallel sequencing 
(Small RNA-seq), resequencing etc., or might be derived 
from in-silico screens such as regions harboring a given 
DNA-motif. A question that typically needs to be 
answered early in the analysis regards the relation of the 
experimentally defined regions with known features in 
the genome. User provided and embedded annotations 
are annotated for overlapping and/or neighboring anno- 
tations by AnnotateGenomicRegions. The definitions of 
overlaps and neighborhood of genomic regions are 
described in Figure 1. 

We consider two genomic regions in Figure 1: One 
region represents a query (shown in red) and the other 
region represents an annotation (shown in blue). Both 
regions have a start and an end position. Figure 1A lists 
the possible relations between the start and the end posi- 
tions of the two genomic regions. We distinguish three 
possible relations between any two positions: larger than, 
equal to, and smaller than. From these relations follow all 
types of overlaps that can be observed during the annota- 
tion process, which are shown in Figure IB. There are 16 
possible types of overlaps between the query and the 
annotation regions, including overlaps of regions with 
length zero. Such regions are often found in data sets on 
genome variation and might describe insertion points, for 
example. Figure 1C depicts what is intended as neighbor- 
ing annotations in AnnotateGenomicRegions. Neighbor- 
ing annotations are not overlapping the query region as 
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Figure 1 Overlap and neighbor queries A) For two genomic regions 
located on the same chromosome, the possible relations (>, =, <) 
between start and end positions of the elements are listed. Note that 
not all configurations are realistic, e.g. start x > end x is forbidden by 
definition. B) Comprehensive overview of all overlap types between 
queries (red bars) and annotations (blue bars) are shown, including 
overlaps of genomic elements of size zero (cases 8, 1 3, 1 4, 1 5, 1 6). C) 
Neighbor annotations are defined as the closest annotation to the 
query that is not overlapping the query in the sense defined by panel B. 
Left neighbors are located upstream and right neighbors are located 
downstream of the query element. For left neighbors, the distance is 
measured between query start and annotation end. For right neighbors, 
the distance is measured between query end and annotation start. 



defined in Figure IB and are closest to the query region. 
Here, "closest" doesn't relate to the physical distance on 
the chromosome. It only means that there is no other 



genomic region in between the query region and the 
neighboring annotation. 

AnnotateGenomicRegions has been designed to satisfy 
three use-cases that are shown in Figure 2. The use-cases 
differ slightly in the type of input that is required from 
the user. In the easiest case, the user submits a set of 
genomic regions and annotates them for annotations that 
are embedded in AnnotateGenomicRegions (Figure 2A). 
Naturally, this mode of using the tool is limited to the 
embedded genome annotations. The output is a table 
that lists overlapping and/or neighboring annotations as 
chosen by the user for each query region. 

In cases where the embedded annotations are not suf- 
ficient, the user may upload his/her own annotations. 
The uploaded files must be in BED format. As shown in 
Figure 2B, in this scenario the user has to upload both 
the query regions and all of the required annotation 
files. The output is represented by a table that annotates 
the query regions for overlaps/neighborhood of each of 
the uploaded annotation files. 

Sometimes the user may wish to know the distance 
between a query region and a neighboring annotation. 
This often happens when transcription factors are studied 
and the distance to the nearest transcription start site is of 
interest. Figure 2C shows the input that the user must 
provide for obtaining distance annotations. The only dif- 
ference from the use-case described in Figure 2B is that 
the uploaded annotation files must contain strand infor- 
mation. The output then lists the name of the annotation 
and the corresponding distance. Note that the strand 
information is required because distances of interest often 
include the 5' end or the 3' end of an annotation. By con- 
vention, the 5' end is identical to the start base of an anno- 
tation on the plus strand and the end base of the 
annotation on the minus strand (vice versa for 3' ends). 

Figure 3 shows screenshots of AnnotateGenomicRe- 
gions. On the "Annotate" pane (Figure 3A) the user is 
invited to choose the annotation release, the genome, the 
desired features for annotation, and whether the annota- 
tions of the neighboring regions should be accepted. The 
query regions should be pasted or uploaded in BED format 
or position coordinates (chromosome:start-end). Upon 
submitting the query, the results are displayed in tabular 
form (Figure 3B), can be downloaded in zip format, and 
pasted into a spreadsheet program such as Microsoft 
Excel or LibreOffice. A hyperlink allows displaying each 
region in the UCSC genome browser [13]. 

For non-standard annotations, a "CUSTOM" menu 
option has been provided. Here, the user can upload an 
annotation file in BED format along with the queries. 
The user chooses the number of desired annotation 
files, browses to the local files containing the annota- 
tions, specifies the column indices for chromosome, 
start, end, and annotation name, and chooses whether 
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Figure 2 Use cases. The three use cases are shown schematically. In all cases, the output can be downloaded and pasted into a spreadsheet. 
A) User provides query regions and selects desired annotations embedded in AnnotateGenomicRegions. Output lists overlapping and/or 
neighboring annotations as defined by the user. B) User provides query regions and annotations. Output lists overlapping and/or neighboring 
annotations as defined by the user. C) User provides query regions and annotations with strand information. Output lists neighboring 
annotations and corresponding distance. 



overlap or neighbors queries are desired. When submit- 
ting the queries, the annotations will be uploaded to the 
server, processed for fast annotation, and the annotations 
will be provided as a zipped output file. When the correct 
genome assembly is chosen on the web form, the regions 
may be viewed in the UCSC genome browser. 

Finally, it is possible to calculate the distance between 
a query region and a given genome annotation. This 
task can be accomplished using the "DISTANCE" menu 
option. The annotations used for distance calculations 
must be provided by the user in the format: 

"Chromosome start end strand annotation" 

The user can calculate the distance to the 5' end, the 3' 
end, the start, the end, or the center position of the anno- 
tation. The distance is calculated either to the start, the 
end, or the center position of the query region, as defined 
by the user. 

The speed of the annotation process using Annotate- 
GenomicRegions was compared to the speed of BEDTools 
[5]. BEDTools is popular in the bioinformatics community 



and provides command line functionality that permits 
annotating a set of query regions with one or more anno- 
tation files. BEDTools uses C++ libraries for annotation 
that were originally developed for the UCSC genome 
browser [5,13]. To speed up the annotation process, these 
libraries employ a binning scheme to the genomic regions 
of the annotations and build an index on the bins 
obtained. AnnotateGenomicRegions, on the other hand, 
uses hash tables for each chromosome holding the annota- 
tions sorted by start position. Auxiliary hash tables are 
then used that memorize the position of the last annota- 
tion found to be overlapping or neighboring a query 
region. Since the query regions are sorted by start position, 
the auxiliary hash tables help minimizing the number of 
times a given annotation is visited. Figure 4 shows the 
response times of AnnotateGenomicRegions and of BED- 
Tools when used for annotating between 10 and 500,000 
query regions with one or ten different annotations. It can 
be seen that AnnotateGenomicRegions is significantly fas- 
ter than BEDTools, particularly for large numbers of query 
regions. The response times for up to 300,000 query 



Zammataro et al. BMC Bioinformatics 2014, 15(Suppl 1):S8 
http://www.biomedcentral.com/1471-2105/15/S1/S8 



Page 5 of 7 



Computational Research IIT.gSEMM: AnnotateGenomlcRegion: 
Menu 







AfflrOfASS 

mown 

HS5B 

0 ,sia 



a Home 'Annotate aCustom -Distance dp ^Contact * Genome A 
Step 1 : Choose annotation release and genome assembly. 



Annotation History ' 



Step 2: Choose annotation of interest. 

'*j report mullple overlaps 



all mRNA TSSpmtU) ACC 
refqgne Svmtffll 



.illjilhiXA^ACC 
rt:r;!Jis_T -Spmittl ID 



ensGena TSSpmlka 



Clsar All overlaps All neighbors 

Step 3: Paste or upload genomic regions to annotate and submit. 



ie regions | Upload re 



Paste URL ot a bed til 



Paste 
example 

Paste far example - , regions 


ctirt 7577506-7577606 

Chr2:7.577,506-7.577.606 

CHR3:75775D6-7.577.606 




69090 70008 NM 001005484 
134772 140566 NR_a393B3 
323391 323581 NR. 028325 
323391 328581 NR 028322 
323891 323531 NR 028327 


Example 

Clear current 
regions 


separated 

dirt 7577506 7577S06 
Chr2 7.577,506 7.577.606 




357S5B 3E3597 NM 001005224 
3570=8 303597 NM 001005221 
367658 3S3597 NM 001005277 
621055 622034 NM 001005224 
621095 622034 NM 0010052211 


Clear 

Submit 
legions 


CHR3 7577506 7.577.606 




Submit 



Credits 



Credks: Computational Research HT@3EMM. Copyright 2011-201 3 0y IIT@SEMM. All rights reserved. Developed using JSF ana ^rimeFaces 



Computational Research IITgSEMM: AnnotateGenomicRegiors 
I ^ enu _ ^T^J aHome /-Annotate aCustom -Distance J>Help 

j3Si33 Annotation results download 

SBSWl — 



Annotation History ' 



Credits: Gompotatiorial Research IITrgSEMM. Copyright 2011-201 3 by IIT@SEMM P 



H Deyelopea using JSF ai 



Figure 3 Screenshots of AnnotateGenomicRegions A) Annotate pane. Here, the user uploads query regions, selects desired release, the 
genome and embedded annotations of interest, defines if multiple hits shall be displayed, and submits the query. Query regions can be 
uploaded by pasting them directly, by uploading a file, or by providing a hyperlink to a file in the appropriate format. The format of the query 
regions can be any of the following position coordinates such as chrl 7577506-7577606 or Chr2:7,577,506-7,577,606 or CHR3:7577506-7,577,606 
(case insensitive, with and without thousands separators, colon between chr and start, minus between start and end). Colon and minus 
characters are not compulsory and space or tabulator characters can be used instead (also called white space characters). Allowing tabulator 
characters as separators permits pasting query regions directly from a spreadsheet holding chromosome, start, and end positions in separate 
columns. Note that query regions can be duplicated. These duplications represent alternatively spliced transcripts with identical start and stop 
positions but with different transcript identifiers. Duplicated regions will be annotated independently from each other as if they were not 
duplicated. B) Output example. The output lists all the annotations found for each query region. In this example, all RefSeq Symbols for 
transcription units overlapping the query regions are shown, separated by two forward slash ("/") characters. A hyperlink for downloading the 
latest results is shown. The hyperlinks corresponding to queries executed earlier are displayed in the Annotation history. Each query region is 
hyperlinked for display in the UCSC genome browser. Duplicated query regions are listed independently from each other and will display 
identical annotations. Keeping the duplicates helps keeping the output compatible with the input so that the output can be pasted easily side- 
by-side to annotations obtained previously. Note that the results page contains a filter function ("annotation results filter"). Typing in the 
associated text field will filter out all rows that do not contain the typed word in any column. This feature is useful if a researcher is interested in 
a given gene, for example. Filters can be applied separately to each column by typing in the text fields labeled as "region" or "annotations". 
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Figure 4 Comparison of AnnotateGenomicRegions and 
BEDTools. AnnotateGenomicRegions and BEDTools were used to 
annotate identical query regions with identical annotation files. The 
number of query regions was varied between 10 and 500,000. The 
response times for AnnotateGenomicRegions were found to be two 
times faster for up to 100 query regions. For 500,000 query regions, 
AnnotateGenomicRegions was found to be 30 times faster for 1 
annotation (A) and 80 times faster for 10 annotations (B). The tests 
were performed on an Ubuntu Server Linux computer with 12 
CPU's and 16 GB RAM. 



regions remain below 10 seconds. This amount of query 
regions exceeds the number of regions to be annotated in 
a typical genomic experiment several fold. 

Conclusions 

AnnotateGenomicRegions is a web-application that 
allows researchers with a wide range of bioinformatics 
skills to annotate genomic regions of interest, e.g. ChlP- 
seq peaks, with overlapping or neighboring annotations. 
In contrast to other tools, AnnotateGenomicRegions, 
can be used easily by non-experts. The annotations are 
provided as tab-delimited text files that can be pasted 
into a spreadsheet. Query regions are hyperlinked for 
viewing them in the UCSC Genome Browser. Com- 
monly used annotations such as Refseq transcripts [11], 
EnsEMBL transcripts [12], or CpG islands are down- 
loaded regularly from the UCSC genome browser repo- 
sitory and made available for instant annotation of 
genomic regions of interest. Users are invited to leave 
their feedback using the contact form to improve the 
software or to participate in future developments of the 
tool. 

Availability and requirements 

♦Project name: AnnotateGenomicRegions 

♦ Project home page: http://cru.genomics.iit.it/ 
AnnotateGenomicRegions 

♦Operating system(s): Platform independent 
♦Programming language: Java 

♦ Other requirements: Java 1.6 or higher, Glassfish 
3.1 or higher 

♦License: Apache license 

♦Any restrictions to use by non-academics: no 
restrictions 

Additional material 



Additional file 1: Annotations and genomes provided by 
AnnotateGenomicRegions. The annotation name column shows the file 
name that is displayed by the web application. The UCSC download file 
name lists the name of the file at the UCSC genome browser repository 
that is used as the annotation source, hg 1 9, hg18, mmlO, mm9, mm8, 
danRer7, sacCer3, and sacCer2 denote the genome assemblies for which 
annotations are embedded in the web application. The region 
description column holds a short description of the annotation file 
content. The region name example shows examples of the names 
associated with each genomic region. 



Abbreviations 

BED: browser extensible data; ChlP-seq: Chromatin Immunoprecipiation 
followed by massively parallel sequencing; danRer7: July 2010 zebrafish 
(Danio rerio) Zv9 assembly produced by The Wellcome Trust Sanger 
Institute; DNase-seq: DNase I hypersensitive sites sequencing; hg 1 8: March 
2006 human reference sequence (NCBI Build 36.1); hg!9: February 2009 
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human reference sequence (Genome Reference Consortium Human Build 
37); Methyl-RRBS: genome-wide methylation profiling using reduced 
representation bisulfite sequencing; mmlO: December 2011 Mus musculus 
assembly (Genome Reference Consortium Mouse Build 38); mm8: February 
2006 Mus musculus assembly (Genome Reference Consortium Mouse Build 
36); mm9: July 2007 Mus musculus assembly (Genome Reference 
Consortium Mouse Build 37); Oct2012: October 2012 release of 
AnnotateGenomicRegions embedded annotations; sacCer2: June 2008 
Saccharomyces cerevisiae genome assembly based on sequence in the 
Saccharomyces Genome Database; SacCer3: April 2011 Saccharomyces 
cerevisiae genome assembly based on sequence in the Saccharomyces 
Genome Database; smallRNA-seq: quantitation of small RNAs by massively 
parallel sequencing. 
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